Statistical Significance Tests are Unnecessary Even When Properly Done and Properly Interpreted: Reply to Commentaries

نویسنده

  • J. Scott Armstrong
چکیده

The three commentators on my paper agree that statistical tests are often improperly used by researchers and that even when properly used, readers misinterpret them. These points have been well established by empirical studies. However, two of the commentators do not agree with my major point that significance tests are unnecessary even when properly used and interpreted. Comments Postprint version. Published in International Journal of Forecasting, Volume 23, Issue 2, April 2007, 335-336. Publisher URL: http://dx.doi.org/10.1016/j.ijforecast.2007.01.010 This journal article is available at ScholarlyCommons: http://repository.upenn.edu/marketing_papers/128 Statistical Significance Tests are Unnecessary Even When Properly Done and Properly Interpreted: Reply to Commentaries J. Scott Armstrong The three commentators on my paper agree that statistical tests are often improperly used by researchers and that even when properly used, readers misinterpret them. These points have been well established by empirical studies. However, two of the commentators do not agree with my major point that significance tests are unnecessary even when properly used and interpreted. I am pleased that Paul Goodwin addressed the issue, “What is new?” Like Goodwin, I believe that the major point of my article is new to the vast majority of researchers and practitioners. Few forecasting researchers or practitioners are aware that there is no empirical evidence supporting the use of statistical significance tests. Despite repeated calls for evidence, no one has shown that the applications of tests of statistical significance improve decision-making or advance scientific knowledge. Schmidt (1996) offered this challenge: “Can you articulate even one legitimate contribution that significance testing has made (or makes) to the research enterprise (i.e., any way in which it contributes to the development of cumulative scientific knowledge)?” Schmidt and Hunter (1997) reported that no such cases have been reported and they repeated Schmidt’s challenge. In their commentaries, Herman Stekler and Keith Ord claim that statistical tests are needed for meta-analyses. Hunter and Schmidt (1996) demonstrate that this is false. They use the following example, which I have paraphrased: Assume that there is a true correlation of .3 between a measure of family composition and juvenile delinquency. What would happen if 50 studies were conducted to test for this relationship and the power of each test was .5 (a typical value in the social sciences)? In this case, half of the studies would conclude that there was a significant relationship at an alpha level of .05, and the other half would conclude that there was no relationship. Thus, by using tests of statistical significance, one would falsely conclude that there was no relationship. Hunter and Schmidt showed that prior to the development of meta-analysis, social science research was plagued with such faulty analyses, and that this hampered scientific progress. They demonstrated the superiority of meta-analyses that use effect sizes and confidence intervals. Becker (1987) contrasts the effects of using combined significance levels in three meta-analyses; She concluded that, when making inferences, effect-size analysis is superior to using combined significance levels. Schmidt (1996) concluded that the use of meta-analysis will eventually lead researchers to abandon significance tests. This will bring scientific procedures in the social sciences more in line with those in the physical sciences. Claims that significance tests are needed for meta-analyses were also refuted by the experience of those involved in summarizing cumulative knowledge for the Principles of Forecasting book (Armstrong 2001). None of the authors said that they needed significance tests. Herman Stekler claims that I ignored papers expressing views that were contrary to mine. Schmidt and Hunter (1997) cite a number of such papers. However, I do believe that the proper way to address this question is to marshal the evidence rather than to take a vote among experts. I actively sought evidence that was contrary to my position. Keith Ord’s view is consistent with what I now believe to be an outdated principle from Armstrong (2001, p. 717): “13.29 Use statistical significance only to compare the accuracy of reasonable methods. Little is learned by rejecting an unreasonable null hypothesis. . . . [Statistical significance] can be useful, however, in making comparisons of reasonable methods when one has only a small sample of forecasts.” No one has publicly challenged this principle, nor offered any evidence. However, after I conducted a further review of the evidence on statistical significance tests, I concluded that principle 13.29 needed to be revised to state instead that tests of statistical significance should not be used. One should assess effect sizes, and use confidence intervals and replications to assess confidence. Koning, Franses, Hibon & Stekler (2005), violated principle 13.29 in that they tested an unreasonable null hypothesis (this may be true of many papers published in the International Journal of Forecasting). However, even if their analysis had been consistent with this principle, it would not have been helpful. And certainly it violates my revised version of the principle. The social sciences have been led astray by significance tests. Scientific research can live without significance testing. Gigerenzer (2000, p 296) wrote, “Several years ago, I spent a day and a night in a library reading through issues of the Journal of Experimental Psychology from the 1920s and 1930s. This was professionally a most depressing experience, but not because these articles were methodologically mediocre. On the contrary, many of them make today’s research pale in comparison with their diversity of methods and statistics.” Some people think there is nothing new under the sun, while others marvel at new insights. Theevidence on significance testing provided new insights to me. If indeed there were nothing new aboutsignificance tests, Koning et. al (2005) would not have been published. It was new to me that significancetests are harming the development of knowledge. In addition, significance tests take up space in journals.Finally, they violate what I believe to be one of the primary functions of statistics: to aid communication. ReferencesArmstrong, J. S. (2001). Principles of Forecasting. Boston: Kluwer. Becker, B. J. (1987), “Applying tests of combined significance in meta-analysis,” Psychological Bulletin,102, 164-171. Gigerenzer, G. (2000), Adaptive Thinking: Rationality in the Real World. Oxford: Oxford University Press. Hunter, J.E. & Schmidt, J. L. (1996) Cumulative research knowledge and social policy formulation: Thecritical role of meta-analysis. Psychology, Public Policy, and Law, 2, 324-347. Koning, A. J., Franses, P.H., Hibon, M. & Stekler, H. O. (2005). The M3 competition: Statistical tests ofthe results. International Journal of Forecasting, 21, 397-409. Schmidt, F.L. (1996) Statistical significance testing and cumulative knowledge in psychology: Implicationsfor training of researchers. Psychological Methods, 1, 115-129. Schmidt, F. L. & Hunter, J. E. (1997). Eight common but false objections to the discontinuation ofsignificance testing in the analysis of research data, in Harlow, Lisa L., Mulaik, S. A. & Steiger, J. H. Whatif there were no Significance Tests? London: Lawrence Erlbaun.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Why Minimally Guided Teaching Techniques Do Not Work: A Reply to Commentaries

In this reply to commentaries on the Kirschner, Sweller, and Clark (2006) paper, we not only reemphasize the importance of randomized, controlled experimental tests of competing instructional procedures, but also indicate that altering one variable at a time is an essential feature of a properly controlled experiment. Furthermore, we also emphasize that variable must be relevant to the issue at...

متن کامل

Towards Better Test Utilization – Strategies to Improve Physician Ordering and Their Impact on Patient Outcomes

Laboratory medicine is the single highest volume medical activity in healthcare and demand for laboratory testing is increasing disproportionately to medical activity. It has been estimated that $6.8 billion of medical care in the US involves unnecessary testing and procedures that do not improve patient care and may even harm the patient. Physicians face many challenges in accurately, efficien...

متن کامل

Approaches to heterogeneity in meta-analysis.

This paper reviews publications from January 1999 to March 2001 on reproductive health topics that were self-identified as meta-analysis or were indexed as meta-analysis in MEDLINE. It sought to assess whether tests of statistical heterogeneity were done, whether the results were reported, and how a finding of significance for a test of statistical heterogeneity was handled and the results inte...

متن کامل

We should focus on the biases that matter : A reply to commentaries

Focus on the biases that matter 2 The commentaries on the target article (Francis, 2013) discussed challenging ideas and new ways of characterizing the issues around bias and the consistency test. I have organized my reply by author with the more negative commentaries being addressed first. I do not try to address every point in every commentary, especially if I feel the point has been addresse...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2007